R Markdown

## put your code here
library(knitr)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(broom)

This project is based on real complany data to analyze the relationships of remote session usage with date time and geolocation of Virtual Machine(VM) in the data center and remote clients. The company name remains anonymous in the project. Two data files: Remote session session.csv.gz in GitHub and Region region.csv.gz in GitHub.

Project Background

The company sells and leases its storage products to customers globally. The audit agent installed on customer’s network allows secure transfer of the error and diagnostic information to be sent back to the Customer Support Center(remote session). Engineers then analyze these error reports and trouble shoot for customers.

The company’s system has been tracking when, where(with IP address), how much and who data transfer comes from. However there is no data analytic tool to analyze these information. The project intends to use R to do statistical computing, depict the remote session allocations graphically, identity capacity shortages, over allocated VMs, hotspot with different geo locations and date time.

With recent three months of remote session data(over one million records) from 1/31/2016 to 4/30/2016, the goal is to find out whether there is a trend that the number of allocated VM sessions, session duration, data transfer has correlation with geolocation and date time.

Check whether normal distribution is followed and whether linear model can be applied. Here are some key benefits:

  1. Investigate the remote session trends with geolocation and date time
  2. Identify potential over allocated VMs from geo location and date time in order to take preventive actions.
  3. Use the model to predict the future VM capacity.

The backbone data comes from the company’s database, which tracks remote sessions’ start/end time, installed agent’s IP address, the VM name located in different data center, how much data has been transferred and the session termination cause. The data is queried via SQL and extracted into CSV fommat(session.csv). A Java program is developed to call IP lookup web service and extract the regional information into CSV format(region.csv)

Data Overview

First load the session file(session.csv.gz), which is extracted from production DB and zipped.

There are over 1 million remote sessions in the past three months. Each row is correspond to each remote session initiated globally.

Here is a description of the columns:

Session_ID: Unique identifier of each remote session

Login: Account name to initiate a remote session to access VM in the data center

Start_Time: The time stamp when a remote session starts

End_Time: The time stamp when a remote session ends

Agent_IP: IP address of the agent installed on the product

Termination_Cause: a lookup value of different types of termination. A normal termination cause value is 3.

Data_Transfer: the file transfer from the agent(client) to the VM(Server) in byte

Gas_Server_Name: One of the fifteen VMs located in five data centers(Cork in Europe, Hopkinton, Durham, Santa Clara in the US and Singapore in Asia). Each data center has three VMs.

Duration: The time span between Start_Time and End_Time in minutes

WeekofYear: The week number since January 1, 2016. For example, January 8, 2016 will be 2.

Dateofyear: The date number since January 1, 2016.For example, January 8, 2016 will be 8.

Weekday: Day of the week. For example: SUN, MON.

## put your code here

session <- read_csv(gzfile("~/Downloads/session.csv.gz"))

login_info<-session %>% group_by(LOGIN) %>% 
  summarize(login_count=n()) 

count(session)
## Source: local data frame [1 x 1]
## 
##         n
##     (int)
## 1 1181760
head(login_info[order(-login_info$login_count),], n=1 )
## Source: local data frame [1 x 2]
## 
##   LOGIN login_count
##   (chr)       (int)
## 1   syr      304708
(head(login_info[order(-login_info$login_count),], n=1 )$login_count)*100 / (count(session))
##          n
## 1 25.78425

The highest number of remote sessions was initiated by a generic account: syr, which stands for 25% of the total login.

Based on the business knowledge, SYR initiates data separately as automated(remote scripting) sessions and it consists of 25% of the total. Therefore filter out this part to better represent the customer experience.

Do data manipulation to separate the start time to year, month, day, hour, minute and second.

## put your code here

p<-session %>% filter((LOGIN!="syr") & (START_TIME!="null")) 

count(p)
## Source: local data frame [1 x 1]
## 
##        n
##    (int)
## 1 877052
p$SDATE <- as.Date(p$START_TIME,"%m/%d/%Y") 

p<-p %>%  separate(SDATE, into=c("SYEAR", "SMON", "SDATE"),sep="-")

p$STIME<-format(as.POSIXct(strptime(p$START_TIME, "%m/%d/%Y %H:%M:%S")), format="%H:%M:%S")
p<-p%>% separate(STIME, into=c("SHOUR", "SMIN", "SSECOND"),sep=":")

Each data center has three VMs named from 01 to 03. Add a DC column mapped from GAS_Server_Name column.

## put your code here
DC <-c(esrgckprd01="Cork",esrgckprd02="Cork",esrgckprd03="Cork",esrghoprd01="Hopkinton", esrghoprd02="Hopkinton", esrghoprd03="Hopkinton", esrgscprd01="Santa Clara", esrgscprd02="Santa Clara", esrgscprd03="Santa Clara", esrgweprd01="Durham", esrgweprd02="Durham",esrgweprd03="Durham", esrgspprd01="Singapore", esrgspprd02="Singapore",esrgspprd03="Singapore")

p$DC <-DC[p$GAS_SERVER_NAME]

Termination cause is an indicator of the connection between the client and the server. Need to see its distribution.

## put your code here
TERMINATION_CODE<-c(0,1,2,3,4,5,6,7,8,9,10,11)
TERMINATION<-c("Not Set","User Shutdown", "Device Shutdown","Abondoned", "User Inactive", "Device Inactive", "System Startup", "No Servers", "No Bridges", "Session Denied", "Failed Merge", "Not Started")
df<-data.frame(cbind(TERMINATION_CODE,TERMINATION),stringsAsFactors=FALSE)
df$TERMINATION_CODE<-as.numeric(df$TERMINATION_CODE)
p<-left_join(p, df, by=c("TERMINATION_CAUSE"="TERMINATION_CODE"))


t<-p %>% group_by(TERMINATION) %>% 
  summarize(login_count=n())

arrange(t,desc(login_count))
## Source: local data frame [8 x 2]
## 
##       TERMINATION login_count
##             (chr)       (int)
## 1       Abondoned      672383
## 2 Device Shutdown      125837
## 3     Not Started       39503
## 4  System Startup       30317
## 5   User Shutdown        6480
## 6   User Inactive        2228
## 7         Not Set         269
## 8 Device Inactive          35
p %>% group_by(TERMINATION) %>% 
  summarize(login_count=n()) %>% 
   mutate(TERMINATION=reorder(TERMINATION,login_count)) %>%
  ggplot(aes(TERMINATION, login_count, fill=TERMINATION)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("Termination Cause")+ylab("The number of remote sessions") +labs(title="Termination Cause Distribution")

From the graph, majority of the termination cause is “Abandoned”, which should be used to assess normal data transactions.

Now consider the geolocation of the client, which is determined by installed agent’s IP address.

Load region.csv.gz, which is generated by a Java program calling IP lookup web service for each IP address.

Then left join session to region so that each row contains both client and server information.

## put your code here
region <- read_csv(gzfile("~/Downloads/region.csv.gz"))

p<-left_join(p, region, by=c("AGENT_IP"="IP"))

Data Visualization

Now observe the trend of between the number of remote sessions and the dayofyear from 1/31/2016 to 4/30/2016. Use point size to denote the average data transfer size in MB.

## put your code here
p %>%group_by(DAYOFYEAR) %>% 
  summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>% 
  ggplot(aes(DAYOFYEAR, login_count, group=1)) +
  geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote sessions on all VMs")

As a whole, the trend of normal remote session was gradually going down with day in the past three months.

Compared to weekdays, the number of remote session over the weekends is much lower.

Next check what the trend during 24 hour cycle is:

## put your code here
p %>% group_by(SHOUR) %>% 
  summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>% 
  ggplot(aes(SHOUR, login_count, group=1)) +
  geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("24 hour in EST")+ylab("The number of remote sessions") +labs(title="Remote sessions around the hour")

It is observed that during 24 hour cycle, more emote sessions were initiated in the morning of EST. More data transfers happened between 8 PM towards midnight EST.

Data Analytic with geolocation of data centers

As remote sessions accessed one of the five data centers, which data center took the most loading in number of remote sessions?

## put your code here
p %>% filter(TERMINATION_CAUSE==3) %>% group_by(DC) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
  geom_hline(aes( yintercept = mean(login_count)))+
xlab("Data Centers")+ylab("The number ofremote sessions") +labs(title="Normal remote sessions across Data Center")

p %>% filter(TERMINATION_CAUSE!=3) %>% group_by(DC) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
  geom_hline(aes( yintercept = mean(login_count)))+
xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Abnormal remote sessions across Data Center")

It turns out the data center in Cork took most abnormal remote sessions((termination cause !=3)) as well as total remote sessions while Cork, Hopkinton and Durham data centers almost has the same normal remote sessions(termination cause ==3). Then what about the average duration considering normal termination and average data transferred across data centers?

## put your code here
p %>% filter(!is.na(DURATION) & (TERMINATION_CAUSE==3)) %>% group_by(DC) %>% summarize(DURATION=mean(as.numeric(DURATION))) %>% 
  ggplot(aes(DC, DURATION, fill=DC)) +geom_bar(stat="identity")+
   geom_hline(aes( yintercept = mean(DURATION)))+
xlab("Data Centers")+ylab("Average remote session durtaion in minutes") +labs(title="Average remote session durtaion across Data Center")

p %>% filter(!is.na(DATA_TRANSFER)&  (TERMINATION_CAUSE==3)) %>% group_by(DC) %>%  summarize(DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>% 
  ggplot(aes(DC, DATA_TRANSFER, fill=DC)) +geom_bar(stat="identity")+
  geom_hline(aes( yintercept = mean(DATA_TRANSFER)))+
xlab("Data Centers")+ylab("Average data transfer in MBs") +labs(title="Average data transfer across Data Center")

It appears the duration was evenly distributed across data centers while Singapore data center received the least data transfer.

The next step is to concentrate on the number of remote session across VMs in all data centers.

## put your code here
p %>% group_by(GAS_SERVER_NAME) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(GAS_SERVER_NAME, login_count, fill=GAS_SERVER_NAME)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("VMs across Data Centers")+ylab("The number of remote sessions") +labs(title="Remote sessions across VMs")

p %>% group_by(DAYOFYEAR,DC) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(x = DAYOFYEAR, y = login_count, color = DC)) + geom_line()+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="The trend of remote sessions for data centers")

The result is consistent with the previous observation: Cork VMs were taking the most loading of remote sessions.

Data Analytic with client geolocation

Now look into the data based on the client country.

## put your code here

login_info<-p %>% group_by(Country) %>% 
  summarize(login_count=n()) 

head(login_info[order(-login_info$login_count),], n=10 )
## Source: local data frame [10 x 2]
## 
##    Country login_count
##      (chr)       (int)
## 1       US      424440
## 2       DE       50434
## 3       GB       34473
## 4       FR       31029
## 5       JP       26221
## 6       ES       25260
## 7       IT       23501
## 8       BR       23005
## 9       AU       20785
## 10      SE       16970
login_info<-p %>% filter(Country=="US")%>% group_by(Region) %>% 
  summarize(login_count=n()) 

head(login_info[order(-login_info$login_count),], n=5 )
## Source: local data frame [5 x 2]
## 
##    Region login_count
##     (chr)       (int)
## 1 Unknown       51011
## 2      NY       50577
## 3      CA       36433
## 4      TX       29390
## 5      NJ       28019

The number of remote sessions requesting from the US is way higher than that from other countries. Within the US, NY state ranks first. Unknown region, ranking in the second place, means the IP address is internal or can not be determined.

## put your code here
p %>% group_by(Country) %>% 
  summarize(login_count=n())   %>% filter(login_count>10000) %>%
  mutate(Country=reorder(Country,login_count)) %>%
  ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution")

Based on the graph, the number of remote sessions coming from US was way ahead of other countries.

Based on the previous observation, the remote sessions to Cork data center were loaded most. It appears a number of US remote sessions routed to Cork Data Center even though there are three US data centers, which is inefficient.

## put your code here
p %>% filter(DC=="Cork") %>%group_by(Country) %>% 
  summarize(login_count=n())   %>% filter(login_count>10000) %>%
  mutate(Country=reorder(Country,login_count)) %>%
  ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution in Cork data center")

p %>% filter(DC=="Cork" & TERMINATION_CAUSE!=3) %>%group_by(Country) %>% 
  summarize(login_count=n())   %>% filter(login_count>5000) %>%
  mutate(Country=reorder(Country,login_count)) %>%
  ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 5,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Abnormal remote session distribution in Cork data center")

The first graph above confirms: the number of remote sessions requesting from the US ranks the second among European countries for Cork data center.

Because it shows previously that Cork data center takes the most abnormal termination(termination_cause!=3), the second graph above further shows the requests from Germany ranks the first in abnormal termination of the remote session routing to Cork data center.

Take another angle, of all the remote sessions coming from US, what about their distribution of the data centers?

## put your code here
p %>% filter(Country=="US") %>%group_by(DC) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
  geom_hline(aes( yintercept = mean(login_count)))+xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Data center distribution of the remote sessions coming from the US")

p %>% filter(Country=="US" & TERMINATION_CAUSE==3) %>%group_by(DC) %>% 
  summarize(login_count=n()) %>% 
  ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
  geom_hline(aes( yintercept = mean(login_count)))+xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Data center distribution of the normal remote sessions coming from the US")

Obviously the US data center in Santa Clara is not fully used while a number of US remote sessions are sent to Cork data center regardless the termination cause. Connection delay is expected. Based on the business knowledge, the agent uses its configured timezone to decide which data center to access. It appears some configurations are wrong.

Data Analytic with linear model and regression

Check whether the number of remote session is in normal distribution.

## put your code here
x<-p %>%
  group_by(DAYOFYEAR) %>% 
 summarize(login_count=n())

fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .) 

hist(x$login_count)

x %>%
  ggplot(aes(DAYOFYEAR, login_count)) + 
  geom_point() +
  geom_abline(intercept = fit$coef[1],
              slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Number of remote sessions")

sd(x$login_count)
## [1] 3229.688
qqnorm(x$login_count)
qqline(x$login_count)

mean(x$login_count) + c(-1,1)*qnorm(0.975)*sd(x$login_count)
## [1]  3307.862 15968.007

The Normal Q-Q plot and standard deviation show the bulk number of remote sessions with day of year does not fit a good normal distribution(big tails!).

From the previous observation, three criteria can be applied:

  1. The remote session count over the weekend is much lower than that in the weekday

  2. The remote session count from the US is much higher than that from other countries

  3. Majority of the remote session’s termination cause is 3 - abandoned

Check whether remote sessions are in normal distribution if these three variables are filtered.

x<-p %>%
  filter((WEEKDAY!="SAT" & WEEKDAY !='SUN')& TERMINATION_CAUSE==3) %>%
  group_by(DAYOFYEAR) %>% 
 summarize(login_count=n())

fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .) 

hist(x$login_count)

x %>%
  ggplot(aes(DAYOFYEAR, login_count)) + 
  geom_point() +
  geom_abline(intercept = fit$coef[1],
              slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Number of normal remote sessions")

sd(x$login_count)
## [1] 738.2565
qqnorm(x$login_count)
qqline(x$login_count)

mean(x$login_count) + c(-1,1)*qnorm(0.975)*sd(x$login_count)
## [1]  7134.213 10028.125
x<-p %>%
  filter(Country=="US" & (WEEKDAY!="SAT" & WEEKDAY !='SUN')& TERMINATION_CAUSE==3) %>%
  group_by(DAYOFYEAR) %>% 
 summarize(login_count=n())

fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .) 

hist(x$login_count)

x %>%
  ggplot(aes(DAYOFYEAR, login_count)) + 
  geom_point() +
  geom_abline(intercept = fit$coef[1],
              slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Number of normal remote sessions from US")

sd(x$login_count)
## [1] 374.5577
qqnorm(x$login_count)
qqline(x$login_count)

mean(x$login_count) + c(-1,1)*qnorm(0.975)*sd(x$login_count)
## [1] 3669.773 5138.012

Once the remote session is filtered with proper criteria, normal Q-Q plots and standard deviation confirm that the linear models fits the data well so that the trend can be correctly predicted.

Data Analytic with strata

Split the data into groups by data center considering the number of remote sessions, duration and data transfer. Filter the data property by considering geolocation, weekend/weekday and termination cause.

## put your code here

p <- p%>% filter(WEEKDAY!="SAT" & WEEKDAY !='SUN' & TERMINATION_CAUSE==3)

theme_set(theme_bw())

DataCenter<-c("Cork", "Hopkinton", "Durham","Santa Clara","Singapore")
p %>%
  group_by(DAYOFYEAR, DC) %>% 
 summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
  filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count, size=DATA_TRANSFER)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Normal remote session trend from the US")

#Duration across Data center
p %>% group_by(DAYOFYEAR, DC) %>% 
 summarize(DURATION=mean(DURATION)) %>%
  filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DURATION)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Duration of remote session in minutes") +labs(title="Duration of normal remote session trend from the US")

#Data Transfer across Data Center
p %>%
  group_by(DAYOFYEAR, DC) %>% 
 summarize(DATA_TRANSFER=mean(DATA_TRANSFER)/1000000) %>%
  filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DATA_TRANSFER)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Data Transfer in MBs ") +labs(title="Data transfer of normal remote session from the US")

#Check the remote login coming from US across Data ceter
p %>% filter(Country=="US") %>%
  group_by(DAYOFYEAR, DC) %>% 
 summarize(login_count=n()) %>%
  filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Number of remote sessions") +labs(title="Normal remote session coming from the US")

#Each data center has three VMs 
VM <-c("esrgckprd01","esrgckprd02","esrgckprd03","esrghoprd01","esrghoprd02","esrghoprd03","esrgweprd01","esrgweprd02","esrgweprd03","esrgscprd01","esrgscprd02","esrgscprd03","esrgspprd01","esrgspprd02","esrgspprd03")

# check remote sessions across each VM
p %>% group_by(DAYOFYEAR, GAS_SERVER_NAME) %>% 
 summarize(login_count=n()) %>%
  filter(GAS_SERVER_NAME %in% VM) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~GAS_SERVER_NAME, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Cork:ckxxx,Hop:hoxxx,Durham:wexxx,Santa Clara:scxxx,Singapore:spxxx")

The number of normal remote session increases in Cork, Santa Clara and Singapore with data of year while it decreases in Hopkinton and Durham. The data center in Santa Clara and Singapore are not fully used.

Average duration of normal remote sessions in each data center increase with data of year. Average data transfer of normal remote session increases in every date center except for Santa Clara.

While the normal remote session requesting from the US increase in Santa Clara, it also increase in Cork, which means more and more agents from the US are not configured correctly and routed to Cork data center instead.

Conclusions

  1. The whole data set is not in perfect normal distribution. However once the remote session is filtered by the source(Country), by weekend data(because the remote session count is much lower than weekdays) and by proper termination cause(majority is from abandoned cause), the filtered data is in normal distribution and confirmed by Normal Q-Q plot.

  2. Cork’s data center was overloaded compared to other data centers obsessed with abnormal termination cause remote session, where the request from Germany contributed the most. Santa Clara and Singapore data centers are under its full capacity in terms of the number of remote sessions.

  3. A number of agents, because their timezone settings are not configured correctly, are routing requests to the data centers far away from their geolocation. For example, a number of remote sessions coming from the US is routed to Cork or Singapore, even though the data center in Santa Clara has enough capacity.

  4. The general trend of the number of remote sessions reduces gradually with date of year The number of remote sessions over the weened is much lower than weekdays. The average duration and data transfer of remote sessions to each data center rises consistently.

  5. The number of remote sessions requesting from the US is much higher than that of any other country.